Skip to content

4.x perf optimizations#84

Open
yuzawa-san wants to merge 84 commits intoIABTechLab:4.Xfrom
yuzawa-san:4.X-perf-optimizations
Open

4.x perf optimizations#84
yuzawa-san wants to merge 84 commits intoIABTechLab:4.Xfrom
yuzawa-san:4.X-perf-optimizations

Conversation

@yuzawa-san
Copy link
Contributor

i have tried the 4.X branch out in a real production environment under real load. this allowed me to profile and kill off more hotspots. here are the main changes:

  • optimize hot loops in AbstractLazilyEncodableSection. here we can skip allocating/calling iterators and just use simple for loops
  • remove ManagedIntegerSet and ManagedFixedList. both of these wrapped the values returned by the field getValue() methods. this was in order for us to determine if the value returned was mutated. in that case the parent field would be marked as dirty. however we found we called a lot of the getValues on the same decoded field and it would return a new wrapper on each call. i was never fully satisfied with my original implementation here, so i reworked this into a different architecture. i introduced a Dirtyable interface which means that the values themselves can track mutations. i made the fields that return a collection return a Dirtyable implementation: IntegerSet, FixedIntegerList, FixedList.
  • this allowed me to clean up the class hierarchy for IntegerSet. also this is a concrete class which means it have virtual dispatch instead of the more expensive interface dispatch.
  • i also converted the method signatures to return FixedIntegerList instead of List<Integer>. that FixedIntegerList is much more optimized: store in a byte array, add methods for unboxed access.
  • added tests

Copy link

@Kevin-P-Kerr Kevin-P-Kerr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that FixedIntegerList can only represent (I believe) 8 bit values. That works for this use case?

@yuzawa-san
Copy link
Contributor Author

@Kevin-P-Kerr yes, the gpp state specs only represent values 0,1,2 as the values (e.g. SensitiveDataProcessing , KnownChildSensitiveDataConsents). theoretically it could have been packed even smaller, but i found byte the smallest primitive which has built-in casts to/from int.

public int setInt(int index, int value) {
// NOTE: int 128 is prevented since it would get turned into byte -128
if(value < 0 || value >= 128) {
throw new IllegalArgumentException("FixedIntegerList only supports positive integers less than 128.");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be fine as the most typical values are 0,1,2 I believe

@yuzawa-san
Copy link
Contributor Author

the most recent changes:

  • clean up maven configuration: use -SNAPSHOT instead of -RC1 (that can be used when do a tagged release, but for now keep it as SNAPSHOT for convenience), added jacoco (coverage is ~95%), added spotless
  • have fixed integer lists be backed by underlying bit string
  • unify bit string reading and writing
  • rewrite how fields are handled. made the schema defined and initialized once and not on each decode/encode
  • generate initial field values only when they are needed. previously we would create all of the collections only for them to be overwritten with whatever we parsed
  • optimized readInt and readLong operations to use fewer bit operations
  • more DRY: remove bespoke segment definitions, since all the traditional base64 segments have the same logic and thus can use the same constructor
d3aca85

Benchmark                              Mode  Cnt      Score     Error   Units
Microbenchmark.run                     avgt   15   1494.297 ±  17.303   ns/op
Microbenchmark.run:gc.alloc.rate       avgt   15  14378.404 ± 182.814  MB/sec
Microbenchmark.run:gc.alloc.rate.norm  avgt   15   1877.334 ±  29.212    B/op
Microbenchmark.run:gc.count            avgt   15    729.000            counts
Microbenchmark.run:gc.time             avgt   15    273.000                ms

15x faster, 70% reduction in memory usage versus the 4.X branch.

this was my final round of optimizations. at this point, i consider this work completed. once this is merged in, i will port the changes from master into the 4.X branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments